The white wine dataset contains 4898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The question we would be interesting in ths analysis is to find what chemical properties influence the quality of white wines? I would do the analysis follow this sequence: Univariate analysis -> bivariate analysis -> multivariate analysis -> final plots -> reflection
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## [1] 4898 13
The data set has 4898 observations and 13 variables. All of the 11 chemical variables type are numerical. The dependent variable quality is integer.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The distribution of quality is close to normal distribution.The range of quality scores are from 3 to 9. Most of the quality data fall in 5 ~ 7. Median score is 6 and mean score is 5.88.
The fixed.acidity distribution is normal distribution. Most of the fixed.acidity values are from 6 to 9.
The distribution of volatile.acidity is normal distribution. Most of the value fall in [0.1, 0.5]. There are a few outlier bigger than 0.6.
Residual.sugar distribution is skewed. So I use log10 transfromation to better see the distribution.
log10(residual.sugar) distribution is bimodal.
The distribution of chlorides is normal distribution. Most of the values fall in [0,0.1]. There are a few outliers bigger than 0.1.
The distribution of pH is normal distribution. Median pH is 3.18.
The distribution of alcohol is multimodal. The range of alcohol is [8, 14]
The distribution of citric.acid is very close to normal distribution. But it is interesting that there are a dramatically up in 0.5. I am curious why there are such phenomenon. So in the next section I will draw the density plot for each quality level to see what happend.
The free.sulfur.dioxide distribution is normal distribution. Most of the values fall in [0, 100]. There are some outliers bigger than 100.
The total.sulfur.dioxide distribution is also normal distribution. Most of the values fall in 0, 250. There are some outlier bigger than 300.
The distribution of density is bimodal. Most of the values fall in [0.99, 1]. There are some outlier bigger than 1.05.
The distribution of sulphates is normal distribution. Most of the values fall in [0.4, 0.6].
There are 4,898 kinds of white wine in the dataset with 11 attributes(fixed.acidity, volatile.acidity, chlorides, pH, citric.acid, residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide, density, sulphates, alcohol). All of these attributes are continuous variables.
The score of quality in the dataset are between 3 and 9. The higher the score, the better the quality. Most of the quality data fall in 5 ~ 7. Median score is 6 and mean score is 5.88.
fixed.acidity, volatile.acidity, chlorides, pH, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, sulphates distribution is or close to normal distribution.
I’d like to determine which chemical properties influence the quality of white wines. I have no idea for now about which variables are more suspicious, so I will print out the correlation table first in the next section to see the correlation.
No, I didn’t.
Yes. Residual.sugar is skewed. I log10 residual.sugar to see better the distribution. And I found that log10(residual.sugar) is distribution. The distribution of citric.acid is very close to normal distribution. But it is interesting that there are a dramatically up in 0.5. I am curious why there are such phenomenon. So in the next section I will draw the density plot for each quality level to see what happend.
It seems that quality is correlated to alcohol and density, then chlorides, total.sulfur.dioxide, volatile.acidity. And there are strong correlations between the independent variables: alcohol & density r=0.8 alcohol & residual.sugar r=0.5 density & risidual.sugar r=0.84 density & total.sulfur.dioxide r=0.53 total.sulfur.dioxide & free.sulfur.dioxide r=0.62
##
## Pearson's product-moment correlation
##
## data: wq$density and wq$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
There is a clear linear relationship between density and alcohol. As the density increase, the alcohol decrease. The pearson’s r = -0.780.
##
## Pearson's product-moment correlation
##
## data: wq$residual.sugar and wq$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
We can see a relationship between residual.sugar and alcohol. Higher residual.sugar has a lower alcohol level. The pearson’s r = 0.45.
##
## Pearson's product-moment correlation
##
## data: wq$residual.sugar and wq$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
There is a clear linear relationship between density and residual.sugar. The pearson’s r = 0.839.
##
## Pearson's product-moment correlation
##
## data: wq$total.sulfur.dioxide and wq$density
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5094349 0.5497297
## sample estimates:
## cor
## 0.5298813
We can see a linear relationship between density and total.sulfur.dioxide. As the total.sulfur.dioxide increase, density increase. The pearson’s r = 0.530.
##
## Pearson's product-moment correlation
##
## data: wq$total.sulfur.dioxide and wq$free.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
We can see a linear relationship between total.sulfur.dioxide & free.sulfur.dioxide. As total.sulfur.dioxide increase, free.sulfur.dioxide increase. The pearson’s r = 0.616.
##
## Pearson's product-moment correlation
##
## data: wq$alcohol and wq$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
At the low quality 3 - 5, the alcohol has a lower median as quality increase. At quality 6 - 9, the alcohol median increase as the quality increase. The highest quality has a highest alcohol median. Quality 5 has a lowest alcohol median. The pearson’s r = 0.436 between quality and alcohol.
##
## Pearson's product-moment correlation
##
## data: wq$density and wq$quality
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
The highest quality has the lowest density median. As the quality increase, the density median decrease. The pearson’s r = -0.307 between quality and density.
##
## Pearson's product-moment correlation
##
## data: wq$chlorides and wq$quality
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
The best quality has a lowest chlorides. And as the quality increase, the chlorides decrease. The pearson’s r = -0.210 between quality and chlorides.
##
## Pearson's product-moment correlation
##
## data: wq$total.sulfur.dioxide and wq$quality
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2017563 -0.1474524
## sample estimates:
## cor
## -0.1747372
Highest total.sulfur.dioxide has the lowest quality median. It seems that as the total.sulfur.dioxide lower, the higher quality.
##
## Pearson's product-moment correlation
##
## data: wq$volatile.acidity and wq$quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
Low quality 3 and 4 have bigger volatile.acidity variance. For quality 5 and 6, the variances are the smallest. For quality higher than 6, the medians are very close. For lower quality 4 and 5, the medians are slightly larger. The pearson’s r = -0.195 between quality and volatile.acidity.
Quality correlates with alcohol(r=0.4) and density(r=0.3), then chlorides(r=0.2), total.sulfur.dioxide(r=0.2), volatile.acidity(r=0.2).
Higer alcohol has a higher probability of having high quality. Lower density has a higher probability of having higher quality. Higer quality seems to have a lower chlorides, lower total.sulfur.dioxide.
Alcohol correlates strongly with density. As the density increase, the alcohol decrease. Alcohol also corelated with residual.sugar. Higer residual sugar has lower alcohol.
Density has a linear relationship with residual.sugar. As residual.sugar increase, the density increase. Density also correlated with total.sulfur.dioxide. As total.sulfur.dioxide increase, density increase.
Total.sulfur.dioxide correlated with free.sulfur.dioxide. As total.sulfur.dioxide increase, free.sulfur.dioxide increase.
The quality is strongly correlated with alcohol and density. But alcohol and density are strongly correlated. So we will just use alcohol in the regression.
Highest quality has a smaller range of alcohol: [10.3, 12.8]. Hold the alcohol, highest quality has the lowest chlorides. lower quality has higer chlorides.
Hold alcohol, higher volatile.acidity seems to have lower quality.
Holde the alcohol, it seems that higher total.sulfur.dioxide has higher quality.
I found a very interesting thing in the density plot that quality 9 has a bimodal distribution for alcohol, fixed.acidity, residual.sugar, free.sulfur.dioxide, density and volatile.acidity. I guess maybe it is because the reponse variable quality is not a concrete variable that we can correctly measure using some function or other measurement. The quality score is quite subjective and for different tester, the standard may be different. So it lead to the bimodal distribution for highest quality.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wq)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wq)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = wq)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## residual.sugar, data = wq)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## residual.sugar + fixed.acidity, data = wq)
##
## ===========================================================================
## m1 m2 m3 m4 m5
## ---------------------------------------------------------------------------
## (Intercept) 2.582*** 3.017*** 2.803*** 2.112*** 2.672***
## (0.098) (0.098) (0.110) (0.125) (0.159)
## alcohol 0.313*** 0.324*** 0.325*** 0.376*** 0.371***
## (0.009) (0.009) (0.009) (0.010) (0.010)
## volatile.acidity -1.979*** -1.963*** -2.091*** -2.103***
## (0.110) (0.110) (0.109) (0.109)
## sulphates 0.416*** 0.453*** 0.443***
## (0.097) (0.095) (0.095)
## residual.sugar 0.027*** 0.028***
## (0.002) (0.002)
## fixed.acidity -0.073***
## (0.013)
## ---------------------------------------------------------------------------
## R-squared 0.2 0.2 0.2 0.3 0.3
## adj. R-squared 0.2 0.2 0.2 0.3 0.3
## sigma 0.8 0.8 0.8 0.8 0.8
## F 1146.4 773.9 523.9 434.1 355.9
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -5839.4 -5681.8 -5672.5 -5610.8 -5594.8
## Deviance 3112.3 2918.3 2907.3 2834.9 2816.5
## AIC 11684.8 11371.6 11355.0 11233.6 11203.7
## BIC 11704.3 11397.5 11387.5 11272.6 11249.2
## N 4898 4898 4898 4898 4898
## ===========================================================================
The variables in this linear model can account for 30% of the variance in the quality of wine.
Alcohol influence the qualtiy of white wine most strongly. I can build a linear model on alcohol and quality. Those other variables I investigate in this section have a relationship with quality but not very strong.
In the density plots, quality 9 has a bimodal distribution for most of the features like alcohol, fixed.acidity, residual.sugar, free.sulfur.dioxide and volatile.acidity. My guess is that the reponse variable quality is not a concrete variable that we can correctly measure using some function or other measurement. The quality score is quite subjective and for different tester, the standard may be different. So it lead to the bimodal distribution for highest quality.
Yes, I created a linear model. The variables in the linear model account for only 30% of the variance in the quality of wine. Although all the variables are significant, alcohol explains 20% of the variance in quality, and the other 5 variables only explain 10% of the variance. This model could explain the quality to some extent, but it is not a very good predictive model.
The problem we are answering is what features influence the quality of white wine. So it would give us a good understanding to see the histogram of quality and get to know the distribution and statistics of quality. The distribution of white wine quality appears to be normal. The mean of the quality is around 5.9. According to the boxplot belowed, the median of quality is around 6. Most of the wine’s qualities fall in 4-7.
As the analysis from previous sections state, white wine quality is most highly related to alcohol. From the plot we can see the highest quality has a highest alcohol median. As the quality increase, the alcohol median increase.
During the analysis process of finding the relationship between quality and other features, I found an interesting phenomenon. From this density plot we can see that quality 9 has a bimodal distribution for alcohol. This is also true for most of the other features fixed.acidity, residual.sugar, free.sulfur.dioxide and volatile.acidity. This is an interesting phenomenon. My guess to this phenomenon is that the reponse variable quality is not a concrete variable that we can correctly measure using some function or other measurement. The quality score is quite subjective and for different tester, the standard may be different. So it lead to the bimodal distribution for highest quality.
The white wine data set contains infromation on 4849 white wines. I started by understanding the individual variables in the data set, and then I explored the relationship among these variables as I continued to make observations on plots. Eventually, I explored the quality of white wine across many variables and created a linear model to predict white wine quality. I assumed that the differences between each interval of quality are equal. So I treated quality as continuous variables to calculate the correlation using Pearson’s correlation. I found that the quality correlates to alcohol most strongly. Also quality correlates to fixed.acidity, free.sulfur.dioxide, volatile.acidity and chlorides, but the correlations are very weak. I found a lot of correlations between the independent variables: alcohol correlates to density and residual sugar; density correlates to residual.sugar and total.sulfur.dioxide; total.sulfur.dioxide correlates to free.sulfur.dioxide; pH correlates to fixed.acidity. The linear model includes alcohol, fixed.acidity, free.sulfur.dioxide, volatile.acidity and chlorides. This model explains only 30% of the variance in the quality. And alcohol alone explains 20% of the variance in the quality. This is probably because the assumption that treat quality as continuous variable is not appropriate. Or maybe there are other factors not included in the data set that influence the quality. Also not large enought data set may also be a reason. I would be interested in performing ordinal regression for ordianl response variable. Also I found a very interesting thing in the plots. Quality 9 has a bimodal distribution for most of the features like alcohol, fixed.acidity, residual.sugar, free.sulfur.dioxide and volatile.acidity. This surprised me. So I guess is because the reponse variable quality is not a concrete variable that we can correctly and consistently measure using some functions or other objective measurements. The quality score is quite subjective and for different tester the standard may be different. So it leads to the bimodal distribution for highest quality. The struggles of doing this project would be the data type. The example we did in the lesson has continuous response variable and various types of independent variables. So it seems there are a lot to explore and a lot of different figures we can plot. In this data set, the response variable is ordinal variable. So I don’t know what to do when I began this. I spent some time learning the difference of categorical, ordinal, interval and ratio variables, as well as searching for some methods to do regression for these variables. At last I choose to make the assumption that the difference between each interval is equal and treat the quality variable as continuous variable. Also I am able to generate different kinds of plot by exploring and refering to some websites. The result might not be good enough, but I have learned a lot through these processes.